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Abstract 

Estimation of structure, such as in variable selection, graphical modelling or cluster 
analysis is notoriously difficult, especially for high-dimensional data. We introduce 
stability selection. It is based on subsampling in combination with (high-dimensional) 
selection algorithms. As such, the method is extremely general and has a very wide 
range of applicability. Stability selection provides finite sample control for some error 
rates of false discoveries and hence a transparent principle to choose a proper amount 
of regularisation for structure estimation. Variable selection and structure estimation 
improve markedly for a range of selection methods if stability selection is applied. We 
prove for randomised Lasso that stability selection will be variable selection consistent 
even if the necessary conditions needed for consistency of the original Lasso method 
are violated. We demonstrate stability selection for variable selection and Gaussian 
graphical modelling, using real and simulated data. 

1 Introduction 

Estimation of discrete structure, such as graphs or clusters, or variable selection is an age-old 
problem in statistics. It has enjoyed increased attention in recent years due to the massive growth 
of data across many scientific disciplines. These large datasets often make estimation of discrete 
structures or variable selection imperative for improved understanding and interpretation. Most 
classical results do not cover the loosely defined case of high-dimensional data, and it is mainly in 
this area where we motivate the promising properties of our new stability selection. 

In the context of regression, for example, an active area of research is to study the p 3> n case, 
where the number of variables or covariates p exceeds the number of observations n; for an early 
overview see for example van de Geer and van Houwelingen (2004). In a similar spirit, graphical 
modelling with many more nodes than sample size has been the focus of recent research, and cluster 
analysis is another widely used technique to infer a discrete structure from observed data. 

Ghallenges with estimation of discrete structures include computational aspects, since corre- 
sponding optimisation problems are discrete, as well as determining the right amount of regularisa- 
tion, for example in an asymptotic sense for consistent structure estimation. Substantial progress 
has been made over the last years in developing computationally tractable methods which have 
provable statistical (asymptotic) properties, even for the high-dimensional setting with many more 
variables than samples. One interesting stream of research has focused on relaxations of some 
discrete optimisation problems, for example by ^i-penalty approaches (Donoho and Elad, 2003; 
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Meinshausen and Biihlmann, 2006; Zhao and Yu, 2006; Wainwright, 2006; Yuan and Lin, 2007) or 
greedy algorithms (Freund and Schapire, 1996; Tropp, 2004; Zhang, 2009). The practical usefulness 
of such procedures has been demonstrated in various applications. However, the general issue of 
selecting a proper amount of regularisation (for the procedures mentioned above and for many oth- 
ers) for getting a right-sized structure or model has largely remained a problem with unsatisfactory 
solutions. 

We address the problem of proper regularisation with a very generic subsampling approach 
(bootstrapping would behave similarly). We show that subsampling can be used to determine the 
amount of regularisation such that a certain faniilywisc type I error rate in multiple testing can be 
conservatively controlled for finite sample size. Particularly for complex, high-dimensional prob- 
lems, a finite sample control is much more valuable than an asymptotic statement with the number 
of observations tending to infinity. Beyond the issue of choosing the amount of regularisation, the 
subsampling approach yields a new structure estimation or variable selection scheme. For the more 
specialised case of high-dimensional linear models, we prove what we expect in greater general- 
ity: namely that subsampling in conjunction with £i-penalised estimation requires much weaker 
assumptions on the design matrix for asymptotically consistent variable selection than what is 
needed for the (non-subsampled) £i-penalty scheme. Furthermore, we show that additional im- 
provements can be achieved by randomising not only via subsampling but also in the selection 
process for the variables, bearing sonic resemblance to the successful tree-based Random Forest 
algorithm (Breiman, 2001). Subsampling (and bootstrapping) has been primarily used so far for 
asymptotic statistical inference in terms of standard errors, confidence intervals and statistical test- 
ing. Our work here is of a very different nature: the marriage of subsampling and high-dimensional 
selection algorithms yields finite sample familywise error control and markedly improved structure 
estimation or selection methods. 

1.1 Preliminaries and examples 

In general, let /? be a p-dimensional vector, where /3 is sparse in the sense that s < p components 
are non-zero. In other words, ||/3||o = s < p. Denote the set of non-zero values by S* = {A; : 7^ 0} 
and the set of variables with vanishing coefficient hy N = {k : Pk = 0}. The goal of structure 
estimation is to infer the set S from noisy observations. 

As a first supervised example, consider data {X^^\Y^^^), . . . ,{X^"\Y^'^^) with univariate re- 
sponse variable Y and p-dimensional covariates X. We typically assume that (XW,fW)'s are 1.1. 
distributed. The vector /3 could be the coefficient vector in a linear model 

Y = Xp + s, (1) 

where Y = (Yi, . . . , y„), X is the n x p design matrix and £ = (ei, . . . , e„) is the random noise 
whose components are independent, identically distributed. Thus, inferring the set S from data is 
the well-studied variable selection problem in linear regression. A main stream of classical methods 
proceeds to solve this problem by penalising the negative log-likelihood with the ^o-norm ||/3||o 
which equals the number of non-zero components of (3. The computational task to solve such 
an ^o-norm penalised optimisation problem becomes quickly unfeasible if p is getting large, even 
when using efficient branch and bound techniques. Alternatively, one can relax the ^o-norm by the 
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^1-norm penalty. This leads to the Lasso estimator (Tibshirani, 1996; Chen et al., 2001), 

p 

= argmin^g^p \\Y - Xpg + Xj^l^kl (2) 

k=l 

where A G M"*" is a regularisation parameter and we typically assume that the covariates are on 
the same scale, i.e. ||Xfc||2 = Y17=ii-^k ^)'^ ~ attractive feature of Lasso is its computational 

feasibility for large p since the optimisation problem in (2) is convex. Furthermore, the Lasso is able 
to select variables by shrinking certain estimated coefficients exactly to 0. We can then estimate 
the set S of non-zero P coefficients by = {k; (3^ ^ 0} which involves convex optimisation only. 
Substantial understanding has been gained over the last few years about consistency of such Lasso 
variable selection (Meinshausen and Biihlmann, 2006; Zhao and Yu, 2006; Wainwright, 2006; Yuan 
and Lin, 2007), and we present the details in Section 3.1. Among the challenges are the issue of 
choosing a proper amount of regularisation A for consistent variable selection and the fact that 
restrictive design conditions are needed for asymptotically recovering the true set S of relevant 
covariates. 

A second example is on unsupervised Gaussian graphical modelling. The data is assumed to be 

X«,...,xWi.i.d. ~AAd(M,S). (3) 

The goal is to infer conditional dependencies among the d variables or components in X = 
{Xi, . . . ,Xd). It is well-known that Xj and X^ are conditionally dependent given all other com- 
ponents {X(^£y, £ / j, k} if and only if 7^ 0, and we then draw an edge between nodes j and 
k in a corresponding graph (Lauritzen, 1996). The structure estimation is thus on the index set 
G = {{j,k); 1 < j < k < d} which has cardinality p = (2) (and of course, we can represent Q 
as a p X 1 vector) and the set of relevant conditional dependencies is S" = {{j, k) G G', 7^ 0}. 
Similarly to the problem of variable selection in regression, ^o-norm methods are computationally 
very hard and become very quickly unfeasible for moderate or large values of d. A relaxation with 
^i-type penalties has also proven to be useful in this context (Meinshausen and Biihlmann, 2006). 
A recent proposal is the graphical Lasso (Friedman et al., 2008): 

= argmine nonneg.def.{- log(det(e)) + tr(Se) + A^ |e,fe|}. (4) 

j<k 

This amounts to an ^i-penalised estimator of the Gaussian log-likelihood, partially maximised 
over the mean vector fi, when minimising over all nonnegative definite symmetric matrices. The 
estimated graph structure is then = {{j,k) G Q; Qjj^ 7^ 0} which involves convex optimisation 
only and is computationally feasible for large values of d. 

Another potential area of application is clustering. Choosing the correct number of cluster 
is a notoriously difficult problem. Looking for clusters that are stable under perturbations or 
subsampling of the data can help to get a better sense of a meaningful number of clusters and to 
validate results. Indeed, there has been some activity in this area, most notably in the context of 
consensus clustering (Monti et al., 2003). For an early application see Bhattacharjee et al. (2005). 
Our proposed false discovery control can be applied to consensus clustering, yielding good estimates 
of the parameters of a suitable base clustering method for consensus clustering. 
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1.2 Outline 



The use of resampling for purposes of validation is certainly not new; we merely try to put it into 
a more formal framework and to show certain empirical and theoretical advantages of doing so. It 
seems difficult to give a complete coverage of all previous work in the notions of stability, 

resampling and perturbations are very natural in the context of structure estimation and variable 
selection. We reference and compare with previous work throughout the paper. 

The structure of the paper is as follows. The generic stability selection approach, its familywise 
type I multiple testing error control and some representative examples from high-dimensional linear 
models and Gaussian graphical models are presented in Section 2. A detailed asymptotic analysis 
of Lasso and randomised Lasso for high-dimensional linear models is given in Section 3 and more 
numerical results are described in Section 4. After a discussion in Section 5, we collect all the 
technical proofs in the Appendix. 

2 Stability selection 

Stability selection is not a new variable selection technique. Its aim is rather to enhance and improve 
existing methods. First, we give a general description of stability selection and we present specific 
examples and applications later. We assume throughout this Section 2 that the data, denoted 
here by Z^^\ . . . , Z^'^\ are independent and identically distributed (e.g. = (xW,yW) ^ith 
covariate X(') and response 

For a generic structure estimation or variable selection technique, we have a tuning parameter 
A G A C M+ that determines the amount of regularisation. This tuning parameter could be the 
penalty parameter in ^i-penalised regression, see (2), or in Gaussian graphical modelling, see (4); 
or it may be number of steps in forward variable selection or Orthogonal Matching Pursuit (Mallat 
and Zhang, 1993) or the number of iterations in Matching Pursuit (Mallat and Zhang, 1993) or 
Boosting (Freund and Schapire, 1996); a large number of steps of iterations would have an opposite 
meaning from a large penalty parameter, but this does not cause conceptual problems. For every 
value A G A, we obtain a structure estimate 5'^ C {1, . . . It is then of interest to determine 
whether there exists a A G A such that 5^ is identical to S with high probability and how to achieve 
that right amount of regularisation. 

2.1 Stability paths 

We motivate the concept of stability paths in the following, first for regression. Stability paths are 
derived from the concept of regularisation paths. A regularisation path is given by the coefficient 
value of each variable over all regularisation parameters: {/?^; A G A, k = 1, . . . ,p}. Stability paths 
(defined below) are, in contrast, the probability for each variable to be selected when randomly 
resampling from the data. For any given regularisation parameter A G A, the selected set is 
implicitly a function of the samples / = {l,...,n}. We write = S^{I) where necessary to 
express this dependence. 

Definition 1 (Selection probabilities) Let I be a random subsample o/{l, . . . , n} of size [n/2j , 
drawn without replacement. For every set K C. {1, . . . the probability of being in the selected 
set S^{I) is 

= P*{K C S^{I)). (5) 
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Figure 1: Left: The Lasso path for the vitamin gene- expression dataset. The paths of the 6 non- 
permuted genes are plotted as solid, red lines, while the paths of the 4082 permuted genes are shown 
as broken, black lines. Selecting a model with all 6 unpermuted genes invariably means selecting a 
large number of irrelevant noise variables. Middle: the stability path of Lasso. The first 4 variables 
chosen with stability selection are truly non-permuted variables. Right: The stability path for the 
'randomised Lasso' with weakness a = 0.2, introduced in Section 3.1. Now all 6 non-permuted 
variables are chosen before any noise variable enters the model. 

Remark 1 The probability P* in (5) is with respect to both the random subsampling (and other 
sources of randomness if S"^ is a randomised algorithm, see Section 3.1). 

The sample size of \ri/2\ is chosen as it resembles most closely the bootstrap (Preedman, 1977; 
Biihlmann and Yu, 2002) while allowing computationally efficient implementation. Subsampling 
has also been advocated in a related context in Valdar et al. (2009). 

For every variable k = 1, . . . ,p, the stability path is given by the selection probabilities 11^, 
A G A. It is a complement to the usual path- plots that show the coefficients of all variables 
k = 1, . . . ,p as a function of the regularisation parameter. It can be seen in Figure 1 that this 
simple path plot is potentially very useful for improved variable selection for high-dimensional data. 

In the remainder of the manuscript, we look at the selection probabilities of individual variables. 
The definition above covers also sets of variables. We could monitor the selection probability of 
a set of functionally related variables, say, by asking how often at least one variable in this set is 
chosen or how often all variables in the set are chosen. 

2.2 Example I: Variable selection in regression 

We apply stability selection to the Lasso defined in (2). We work with a gene expression dataset 
for illustration which is kindly provided by DSM Nutritional Products (Switzerland). For n = 115 
samples, there is a continuous response variable measuring the logarithm of riboflavin (vitamin 
B2) production rate of Bacillus Subtilis, and we have p = 4088 continuous covariates measuring 
the logarithm of gene expressions from essentially the whole genome of Bacillus Subtilis. Certain 
mutations of genes are thought to lead to higher vitamin concentrations and the challenge is to 



identify those relevant genes via a linear regression analysis. That is, we consider a linear model as 
in (1) and want to infer the set S = {k; fik 7^ 0}. 

Instability of the selected set of genes has been noted before (Ein-Dor et al., 2005; Michiels et al., 
2005), if either using marginal association or variable selection in a regression or classification model. 
Davis et al. (2006) are close in spirit to our approach by arguing for 'consensus' gene signatures 
which assess the stability of selection, while Zucknick et al. (2008) propose to measure stability of 
so-called 'molecular profiles' by the Jaccard index. 

To see how Lasso and the related stability path cope with noise variables, we randomly permute 
all but 6 of the 4088 gene expression across the samples, using the same permutation to keep the 
dependence structure between the permuted gene expressions intact. The set of 6 unpermuted 
genes has been chosen randomly among the 200 genes with the highest marginal association with 
the response. The Lasso path {(5^] A G A} is shown in the left panel of Figure 1, as a function of the 
regularisation parameter A (rescaled so that A = 1 is the minimal A-value for which the null model 
is selected and A = amounts to the Basis Pursuit solution). Three of the 'relevant' (unpermuted) 
genes stand out, but all remaining three variables are hidden within the paths of noise (permuted) 
genes. The middle panel of Figure 1 shows the stability path. At least four relevant variables stand 
out much clearer now than they did in the regularisation path plot. The right panel shows the 
stability plot for randomised Lasso which will be introduced in Section 3.1: now all 6 unpermuted 
variables stand above the permuted variables and the separation between (potentially) relevant 
variables and irrelevant variables is even better. 

Choosing the right regularisation parameter is very difficult for the original path. The prediction 
optimal and cross- validated choice include too many variables (Meinshausen and Biihlmann, 2006; 
Leng et al., 2006) and the same effect can be observed in this example, where 14 permuted variables 
are included in the model chosen by cross-validation. Figure 1 motivates that choosing the right 
regularisation parameter is much less critical for the stability path and that we have a better chance 
to select truly relevant variables. 

2.3 Stability selection 

In a traditional setting, variable selection would amount to choosing one element of the set of 
models 

{S\ AeA}, (6) 

where A is again the set of considered regularisation parameters, which can be either continuous 
or discrete. There are typically two problems: first, the correct model S might not be a member of 
(6). Second, even if it is a member, it is typically very hard for high-dimensional data to determine 
the right amount of regularisation A to select exactly S, or to select at least a close approximation. 

With stability selection, we do not simply select one model in the list (6). Instead the data are 
perturbed (for example by subsampling) many times and we choose all structures or variables that 
occur in a large fraction of the resulting selection sets. 

Definition 2 (Stable vsiriables) For a cutoff irthr with < TTthr < 1 and a set of regularisation 
parameters A, the set of stable variables is defined as 

^stable ^ 1^ . jnaxfl^ > TTthr}- (7) 

AeA 
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Wc keep variables with a high selection probability and disregard those with low selection 
probabilities. The exact cutoff nthr with < nthr < 1 is a tuning parameter but the results vary 
surprisingly little for sensible choices in a range of the cutoff. Neither do results depend strongly 
on the choice of regularisation A or the regularisation region A. See Figure 1 for an example. 

Before we present some guidance on how to choose the cutoff parameter and the regularisation 
region A below, it is worthwhile pointing out that there have been related ideas in the literature on 
Bayesian model selection. Barbieri and Berger (2004) show certain predictive optimality results for 
the so-called median probability model, consisting of variables which have posterior probability of 
being in the model of 1/2 or greater (as opposed to choosing the model with the highest posterior 
probability). Lee et al. (2003) or Sha et al. (2004) are examples of more applied papers considering 
Bayesian variable selection in this context. 



2.4 Choice of regularisation and error control 

When trying to recover the set S, a natural goal is to include as few variables of the set N of noise 
variables as possible. The choice of the regularisation parameter is hence crucial. An advantage 
of our stability selection is that the choice of the initial set of regularisation parameters A has 
typically not a very strong influence on the results, as long as A is varied with reason. Another 
advantage, which we focus on below, is the ability to choose this set of regularisation parameters 
in a way that guarantees, under stronger assumptions, a certain bound on the expected number of 
false selections. 

Definition 3 (Additional notation) Let = U^gAS'^ be the set of selected structures or vari- 
ables if varying the regularisation A in the set A. Let be the average number of selected variables, 
qA = E{\S^{I)\). Define V to be the number of falsely selected variables with stability selection, 

V = |ivn s**"*'^!- 

In general, it is very hard to control E(y), as the distribution of the underlying estimator /? depends 
on many unknown quantities. Exact control is only possible under some simplifying assumptions. 

Theorem 1 (Error control) Assume that the distribution of {l^j^^gxj,k £ N} is exchangeable 
for all A G A. Also, assume that the original procedure is not worse than random guessing, i.e. for 
any A G A, 

E{\sns^\) ^ \s[ 

E{\Nns^\) ~ \N\' ^ ' 

The expected number V of falsely selected variables is then bounded by 

E{V) < ^ ^ ^ ^. (9) 

We will discuss below how to make constructive use of the value q\ which is in general an unknown 
quantity. The expected number of falsely selected variables is sometimes called the per-family error 
rate {PFER) or, if divided by p, the per-comparison error rate {FCER) in multiple testing (Dudoit 
et al., 2003). Choosing less variables (reducing ^a) or increasing the threshold -Kthr for selection 
will, unsurprisingly, reduce the the expected number of falsely selected variables, with a minimal 
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achievable non-trivial value of (for TTj/jr = 1 ^^^id q\ = 1) for the PFER. This seems low enough 
for all practical purposed as long as p > 10, say. 

The involved exchangeability assumption is perhaps stronger than one would wish, but there 
does not seem to be a way of getting error control in the same generality without making similar 
assumptions. For regression in (1), the exchangeability assumption is fulfilled for all reasonable 
procedures S if the design is random and the distribution of {X^^k G N} is exchangeable. Inde- 
pendence of all variables in A*" is a special case. More generally, the variables could have a joint 
normal distribution with Cov(Xfc, X/) = p for all k,l e N with k ^ I and < p < 1. For real data, 
wc have no guarantee that the assumption is fulfilled but the numerical examples in Section 4 show 
that the bound holds up very well for real data. 

Note also that the assumption of exchangeability is only needed to prove Theorem 1. All 
other benefits of stability selection shown in this paper do not rely on this assumption. Besides 
exchangeability, we needed another, quite harmless, assumption, namely that the original procedure 
is not worse than random guessing. One would certainly hope that this assumption is fulfilled. If 
it is not, the results below are still valid with slightly weaker constants. The assumption seems so 
weak, however, that we do not pursue this further. 

The threshold value irthr is a tuning parameter whose influence is very small. For sensible 
values in the range of, say, iTthr € (0.6,0.9), results tend to be very similar. Once the threshold is 
chosen at some default value, the regularisation region A is determined by the desired error control. 
Specifically, for a default cutoff value nthr = 0.9, choosing the regularisation parameters A such 
that say qa = ^/0.8p will control E{V) < 1; or choosing A such that qa = ^/0.8 ap controls the 
familywise error rate (FWER) at level a, i.e. P(y > 0) < a. Of course, we can proceed the other 
way round by fixing the regularisation region A and choosing irthr such that E{V) is controlled at 
the desired level. 

To do this, we need knowledge about qa- This can be easily achieved by regularisation of the 
selection procedure S = S'' in terms of the number of selected variables q. That is, the domain A 
for the regularisation parameter A determines the number q of selected variables, i.e. q = q{A). For 
example, with £i-norm penalisation as in (2) or (4), the number q is given by the variables which 
enter first in the regularisation path when varying from a maximal value Amax to some minimal 
value Xmin- Mathematically, Amin is such that | UA^^^>A>Amin S^\ < q. 

Without stability selection, the regularisation parameter A invariably has to depend on the 
unknown noise level of the observations. The advantage of stability selection is that (a) exact error 
control is possible, and (b) the method works fine even though the noise level is unknown. This is 
a real advantage in high-dimensional problems with n, as it is very hard to estimate the noise 
level in these settings. 

Pointwise Control. For some applications, evaluation of subsampling replicates of are already 
computationally very demanding for a single value of A. If this single value A is chosen such that 
some overfitting occurs and the set is rather too large, in the sense that it contains S with 
high probability, the same approach as above can be used and is in our experience very successful. 
Results typically do not depend strongly on the utilised regularisation A. See the example below 
for graphical modelling. Setting A = {A}, one can immediately transfer all results above to the 
case of what we call here pointwise control. For methods which select structures incrementally, i.e. 
for which C S'^ for all A > A', pointwise control and control with A = [A,oo) are equivalent 
since 11^ is then monotonically increasing with decreasing A for all A; = 1 , . . . , p. 
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^ = 0.46 X = 0.448 ^ = 0.436 ^ = 0.424 X = 0A^2 X = OA 




Figure 2: Vitamin gene-expression dataset. The regularisation path of graphical lasso (top row) 
and the corresponding point-wise stability selected models (bottom row). 

2.5 Example II: Graphical modelling 

Stability selection is also promising for graphical modelling. Here we focus on Gaussian graphical 
models as described in Section 1.1 around formula (3) and (4). 

The pattern of non-zero entries in the inverse covariance matrix corresponds to the edges 
between the corresponding pairs of variables in the associated graph and is equivalent to a non-zero 
partial correlation (or conditional dependence) between such pairs of variables (Lauritzen, 1996). 

There has been interest recently in using £i-penalties for model selection in Gaussian Graphical 
models due to their computational efficiency for moderate and large graphs (Meinshausen and 
Buhlmann, 2006; Yuan and Lin, 2007; Friedman et al., 2008; Banerjee and El Ghaoui, 2008; Bickel 
and Levina, 2008; Rothman ct al., 2008). Here we work with the graphical Lasso (Friedman et al., 
2008), as applied to the data from 160 randomly selected genes from the vitamin gene-expression 
dataset (without the response variable) introduced in Section 2.2. We want to infer the set of 
non-zero entries in the inverse covariance matrix Part of the resulting regularisation path 

of the graphical Lasso showing graphs for various values of the regularisation parameter A, i.e. 
{S'^; A G A} where S*^ = {(j, k); (S~^)j*^ 7^ 0}, are shown in the first row of Figure 2. For reasons 
of display, variables (genes) are ordered first using hierarchical clustering and are symbolised by 
nodes arranged in a circle. Stability selection is shown in the bottom row of Figure 2. We pursue 
a pointwise control approach. For each value of A, we select the threshold nthr so as to guarantee 
E{V) < 30, that is we expect fewer than 30 wrong edges among the 12720 possible edges in the 
graph. The set 5**"*'^ varies remarkably little for the majority of the path and the choice of q 
(which is implied by A) does not seem to be critical, as already observed for variable selection in 
regression. 

Next, we permute the variables (expression values) randomly, using a different permutation for 
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X = 0.065 X = 0.063 X = 0.06^ ^ = 0.059 ^ = 0.057 ?i = 0.055 




Figure 3: The same plot as in Figure 2 hut with the variables (expression values of each gene) 
permuted independently. The empty graph is the true model. With stability selection, only a few 
errors are made, as guaranteed by the made error control. 

each variable (gene). The true graph is now the empty graph. As can be seen from Figure 3, 
stabihty selection selects now just very few edges or none at all (as it should). The top row shows 
the corresponding graphs estimated with the graphical Lasso which yields a much poorer selection 
of edges. 

2.6 Computational requirements 

Stability selection demands to re-run {S^; A G A} multiple times. Evaluating selection probabilities 
over 100 subsamples seems sufficient in practice. The algorithmic complexity of Lasso in (2) or in 
(13) below is of the order 0(npmin{n,p}), see Efron et al. (2004). In the p > n regime, running 
the full Lasso path on subsamples of size ra/2 is hence a quarter of the cost of running the algorithm 
on the full dataset and running 100 simulations is 25 times the cost of running a single fit on the 
full dataset. This cost could be compared with the cost of cross-validation, as this is what one 
has to resort to often in practice to select the regularisation parameter. Running 10-fold cross- 
validation uses approximately 10 • 0.9^ = 8.1 as many computational resources as the single fit on 
the full dataset. Stability selection is thus roughly three times more expensive than 10-fold CV. 
This analysis is based on the fact that the computational complexity scales like O(n^) with the 
number of observations (assuming p > n). If computational costs would scale linearly with sample 
size (e.g. for Lasso with p < n), this factor would increase to roughly 5.5. 

Stability selection with the Lasso (using 100 subsamples) for a dataset with p = 1000 and 
n = 100 takes about 10 seconds on a 2.2GHz processor, using the implementation of Friedman 
et al. (2007). Computational costs of this order would often seem worthwhile, given the potential 
benefits. 
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3 Consistent variable selection 



Stability selection is a general technique, applicable to a wide range of applications, some of which 
we have discussed above. Here, we want to discuss advantages and properties of stability selection 
for the specific application of variable selection in regression with high-dimensional data which is a 
well-studied topic nowadays (Meinshausen and Biihlmann, 2006; Zhao and Yu, 2006; Wainwright, 

2006) . We consider a linear model as in (1) with Gaussian noise, 

Y = Xp + e, (10) 

with fixed n x p design matrix X and £!,...,£„ i.i.d. A/'(0, cr^). The predictor variables are nor- 
malised with \\Xk\\2 = {J27=i(-^k ^)'^y^^ = 1 for all A; G {1, . . . ,p}. We allow for high-dimensional 

settings where p ^ n. 

Stability selection is attractive for two reasons. First, the choice of a proper regularisation 
parameter for variable selection is crucial and notoriously difficult, especially because the noise 
level is unknown. With stability selection, results are much less sensitive to the choice of the 
regularisation. Second, we will show that stability selection makes variable selection consistent in 
settings where the original methods fail. 

We give general conditions under which consistent variable selection is achieved with stability 
selection. Consistent variable selection for a procedure S is understood to be equivalent to 

P{S = S)^1 n^oo. (11) 

It is clearly of interest to know under which conditions consistent variable selection can be achieved. 
In the high-dimensional context, this places a restriction on the growth of the number p of variables 
and sparsity \S\, typically of the form liSI • logp = o(n) (Meinshausen and Biihlmann, 2006; Zhao 
and Yu, 2006; Wainwright, 2006). While this assumption is often realistic, there are stronger 
assumptions on the design matrix that need to be satisfied for consistent variable selection. For 
Lasso, it amounts to the 'neighbourhood stability' condition (Meinshausen and Biihlmann, 2006) 
which is equivalent to the 'irrepresentable condition' (Zhao and Yu, 2006; Zou, 2006; Yuan and Lin, 

2007) . For Orthogonal Matching Pursuit (which is essentially forward variable selection), the so- 
called 'exact recovery criterion' (Tropp, 2004; Zhang, 2009) is sufficient and necessary for consistent 
variable selection. 

Here, we show that these conditions can be circumvented more directly by using stability se- 
lection, also giving guidance on the proper amount of regularisation. Due to the restricted length 
of the paper, we will only discuss in detail the case of Lasso whereas the analysis of Orthogonal 
Matching Pursuit is just indicated. 

An interesting aspect is that stability selection with the original procedures alone yields often 
very large improvements already. Moreover, when adding some extra sort of randomness in the 
spirit of Random Forests (Brciman, 2001) weakens considerably the conditions needed for consistent 
variables selection as discussed next. 

3.1 Lasso and randomised Lasso 

The Lasso (Tibshirani, 1996; Chen et al., 2001) estimator is given in (2). For consistent variable 
selection using = {k; $^ ^ 0}, it turns out that the design needs to satisfy the so-called 



11 



'neighbourhood stabihty' condition (Mcinshausen and Biihlmann, 2006) which is equivalent to the 
'irrepresentable condition' (Zhao and Yu, 2006; Zou, 2006; Yuan and Lin, 2007): 

max \sign{Psf{X^Xsr^X^Xk\ < 1. (12) 

The condition in (12) is sufficient and (almost) necessary (the word 'almost' refers to the fact that 
a necessary relation is using '<' instead of '<'). If this condition is violated, all one can hope for 
is recovery of the regression vector /? in an ^2-sense of convergence by achieving ||^^ — /3||2 — 
for n oo. The main assumption here arc bounds on the sparse eigenvalues as discussed below. 
This type of £2-convergence can be used to achieve consistent variable selection in a two-stage 
procedure by thresholding or, preferably, the adaptive Lasso (Zou, 2006; Huang et al., 2008). The 
disadvantage of such a two-step procedure is the need to choose several tuning parameters without 
proper guidance on how these parameters can be chosen in practice. We propose the randomised 
Lasso as an alternative. Despite its simplicity, it is consistent for variable selection even though 
the 'irrepresentable condition' in (12) is violated. 

Randomised Lasso is a new generalisation of the Lasso. While the Lasso penalises the absolute 
value \(3k\ of every component with a penalty term proportional to A, the randomised Lasso changes 
the penalty A to a randomly chosen value in the range [A, A/a]. 

Randomised Lasso with weakness a G (0, 1]: 

Let Wk be i.i.d. random variables in [a, 1] for k = 1,. . . ,p. The randomised 
Lasso estimator (3^'^ for regularisation parameter A G M is then 




A proposal for the distribution of the weights Wk is described below, just before Theorem 2. The 
word 'weakness' is borrowed from the terminology of weak greedy algorithms (Temlyakov, 2000) 
which are loosely related to our randomised Lasso. Implementation of (13) is straightforward by 
appropriate re-scaling of the predictor variables (with scale factor Wk for the A;-th variable). Using 
these re-scaled variables, the standard Lasso is solved, using for example the LARS algorithm (Efron 
et al., 2004) or fast coordinate wise approaches (Meier et al., 2008; Friedman et al., 2007). The 
perturbation of the penalty weights is reminiscent of the re-weighting in the adaptive Lasso (Zou, 
2006). Here, however, the re- weighting is not based on any previous estimate, but is simply chosen 
at random! As such, it is very simple to implement. However, it seems nonsensical at first sight 
since one can surely not expect any improvement from such a random perturbation. If applied only 
with one random perturbation, randomised Lasso is not very useful. However, applying randomised 
Lasso many times and looking for variables that are chosen often will turn out to be a very powerful 
procedure. 

Consistency for randomised Lasso with stability selection. For stability selection with 
randomised Lasso, we can do without the irrepresentable condition (12) but need only a condition 
on the sparse eigenvalues of the design (Candes and Tao, 2007; van de Geer, 2008; Meinshausen and 
Yu, 2009; Bickel et al., 2007), also called the sparse Riesz condition in Zhang and Huang (2008). 
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Definition 4 (Sparse Eigenvalues) For any C {1, . . . let Xk he the restriction of X to 
columns in K. The minimal sparse eigenvalue 0min is then defined for k < p as 

^r.Uk)= inf (14) 

ae'R^''^ ,KC{l,...,p}:\K\<\k'] \\a\\2 

and analogously for the maximal sparse eigenvalue (?!)max- 
We have to constrain sparse eigenvalues to succeed. 

Assumption 1 (Sparse eigenvalues) There exists some C > 1 and some k>9 such that 

This assumption (15) is related to the sparse Riesz condition in Zhang and Huang (2008). The 
equivalent condition there requires the existence of some C > such that 

0_((2 + 4C). + l)^^^ (16) 



(^,„in((2+4C)s + l) 



compare with Remark 2 in Zhang and Huang (2008). This assumption essentially requires that 
maximal and minimal eigenvalues, for a selection of order s variables, are bounded away from 
and oo respectively. In comparison, our assumption is significantly stronger than (16), but at the 
same time typically much weaker than the standard assumption of the 'irrepresentable condition' 
necessary to get results comparable to ours. 

We have not specified the exact form of perturbations we will be using for the randomised Lasso 
in (13). For the following, we consider the randomised Lasso of (13), where the weights Wk are 
sampled independently as Wk = oi with probability p^ € (0, 1) and Wk = 1 otherwise. Other 
perturbations are certainly possible and work often just as well in practice. 

Theorem 2 Consider the model in (10). For randomised Lasso, let the weakness a he given hy 
c? = u(j)iain{m)/m, for any v € {{7 /k)'^^ 1/^2), and m = Cs"^. Let an he a sequence with a„ ^ oo 
for n —>■ oo. Let Amin = 2a{\/2Cs + l)^/log{p V an)/n. Assume that p > 10 and s > 7 and 
that the sparse eigenvalue Assumption 1 is satisfied. Then there exists some 5 = 5s & (0, 1) such 
that for all nthr > 1 — S, stahility selection with the randomised Lasso satisfies on a set Qa with 
P{^a) > 1 — 5/(p V a„) that no noise variahles are selected, 

^^gstable ^ 0^ 

where S^"''''"'^ = {k : H^ > TTthr} with A > Amin- On the same set ^Ia, 

{S\Ssmall;x) C ^f"*-'^ (18) 

where Ssmali;\ = {k '■ \Pk\ < 0.3(Cs)^/^A}. This implies that all variahles with sufficiently large 
regression coefficient are selected. 
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Remark 2 Under the condition that the minimal non-zero regression coefficient is bounded from 
below by mink^s \Pk\ > (Cs)^/^(0.3A), as a consequence of Theorem 2, 

P(5 = 5f'^^'^)>l-l/(pVa„), 

i.e. consistent variable selection for an ^ oo (p ^ oo or n oo) in the sense of (11) even 
if the irrepresentable condition (12) is violated. If no such lower bound holds, the set of selected 
variables might miss variables with too small regression coefficients, which are, by definition, in the 

set Ssmall;\- 

Remark 3 Theorem 2 is valid for all A > Amin- This is noteworthy as it means that even if 
the value of X is chosen too large (i.e. considerably larger than Xmm)? no noise variables will be 
selected (formula (17)). Only some important variables might be missed. This effect has been seen 
in the empirical examples as stability selection is very insensitive to the choice of A. In contrast, 
a hard-thresholded solution of the Lasso with a value of A too large will lead to the inclusion of 
noise variables. Thus, stability selection with the randomised Lasso exhibits an important property 
of being conservative and guarding against false positive selections. 

Remark 4 Theorem 2 is derived under random perturbations of the weights. While this achieves 

good empirical results, it seems more advantageous in combination with with subsampling of the 
data. The results extend directly to this case. Let be the selection probability of variable k G 
S\Ssmall;\, while doing both random weight perturbations and subsampling n/2 out ofn observations. 
The probability that is above the threshold iTthr € (0, 1) is bounded by a Markov-type inequality 
from below by 



having used that -E(n^) > 1 — 5/(p V 0^/2) a consequence of Theorem 2. Ifb/{p\J an/2) suffi- 
ciently small in comparison to 1 — 'Kthrj this elementary inequality implies that important variables 
in S\Ssmall;X o,f^ still chosen by stability selection (subsampling and random weights perturbation) 

with very high probability. A similar argument shows that noise variables are also still not chosen 
with very high probability. Empirically, combining random weight perturbations with subsampling 
yields very competitive results and this is what we recommend to use. 

There is an inherent tradeoff when choosing the weakness a. A negative consequence of a low 
a is that the design can get closer to singularity and can thus lead to unfavourable conditioning 
of the weighted design matrix. On the other hand, a low value of a makes it less likely that 
irrelevant variables are selected. This is a surprising result but rests on the fact that irrelevant 
variables can only be chosen if the corresponding irrepresentable condition (12) is violated. By 
randomly perturbing the weights with a low a, this condition is bound to fail sometimes, lowering 
the selection probabilities for such variables. A low value of a will thus help stability selection to 
avoid selecting noise variables with a violated irrepresentable condition (12). In practice, choosing 
a in the range of (0.2, 0.8) gives very useful results. 

Relation to other work. In related and very interesting work. Bach (2008) has proposed 'Bo- 
lasso' (for bootstrapped enhanced Lasso) and shown that using a finite number of subsamples of the 
original Lasso procedure and applying basically stability selection with nthr = 1 yields consistent 
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Figure 4: The stability paths for randomised Lasso with stability selection using weakness pa- 
rameters a = 1 (left panel identical to the original Lasso) and a = 0.5 (middle) and a = 0.2 
(right). Red solid lines are the coefficients of the first two (relevant variables). The blue broken 
line is the coefficient of the third (irrelevant) variable and the dotted lines are the coefficients from 
all other (irrelevant) variables. Introducing the randomised version helps avoid choosing the third 
(irrelevant) predictor variable. 

variables selection under the condition that the penalty parameter A vanishes faster than typically 
assumed, at rate n~^^'^, and that the model dimension p is fixed. While the latter condition could 
possibly be technical only, the first distinguishes it from our results. Applying stability selection 
to randomised Lasso, no false variable is selected for all sufficiently large values of A, see Remark 
3. In other words, if A is chosen 'too large' with randomised Lasso, only truly relevant variable are 
chosen (though a few might be missed). If A is chosen too large with Bolasso, noise variables might 
be picked up. Figure 4 is a good illustration. Picking the regularisation in the left plot (without 
extra randomness) to select the correct model is much harder than in the right plot, where extra 
randomness is added. The same distinction can be made with two-stage procedures like adaptive 
Lasso (Zou, 2006) or hard-thresholding (Candes and Tao, 2007; Meinshausen and Yu, 2009), where 
variables are thresholded. Picking A too large (and A is notoriously difficult to choose), false vari- 
ables will invariably enter the model. In contrast, stability selection with randomised Lasso is not 
picking wrong variables if A is chosen too large. 

3.2 Example 

We illustrate the results on randomised Lasso with a small simulation example: p = n = 200 and 
the predictor variables are sampled from a AA(0, S) distribution, where S is the identity matrix, 
except for the entries S13 = S23 = p and their symmetrical counterparts. We use a regression 
vector P = (1, 1, 0, 0, . . . , 0). The response Y is obtained from the linear model Y = X(3 -|- e in (1), 
where ei, . . . ,£n i.i.d. AA(0, 1/4). For p > 0.5, the irrepresentable condition in (12) is violated and 
Lasso is not able to correctly identify the first two variables as the truly important ones, since it 
always includes the third variable superfluously as well. Using the randomised version for Lasso, 
the two relevant variables are still chosen with probability close to 1, while the irrelevant third 
variable is only chosen with much lower probability; the corresponding probabilities are shown 
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for randomised Lasso in Figure 4. This allows to separate relevant and irrelevant variables. And 
indeed, the randomised Lasso is consistent under stability selection. 

3.3 Randomised orthogonal Matching Pursuit 

An interesting alternative to Lasso or greedy forward search in this context are the recently proposed 
forward-backward search FOBA (Zhang, 2008) and the MC+ algorithm (Zhang, 2007), which both 
provably lead to consistent variable selection under weak conditions on sparse eigenvalues, despite 
being greedy solutions to non-convex optimisation problems. It will be very interesting to explore 
the effect of stability selection on these algorithms, but this is beyond the scope of this paper. 

Here, we look instead at orthogonal matching pursuit, a greedy forward search in the variable 
space. The iterative SIS procedure (Fan and Lv, 2008), entails orthogonal matching pursuit as a 
special case. We will examine the effect of stability selection under subsampling and additional 
randomisation. To have a clear definition of randomised orthogonal matching pursuit (ROMP), we 
define it as follows. 



Randomised orthogonal matching pursuit with weakness < a < 


1 and q 


iterations. 




1. Set Ri = Y. Set m = and S° = 0. 




2. For m= 1,. .. ,q: 




(a) Find pmax = maxi<fe<p \X'^Rm\ 




(b) Define K = {k : > ap^nax}- 




(c) Select randomly a variable kgei in the set K and set S"^ = 




{ksel}- 




(d) Let Rm+i = Y — PmY, where the projection Pm is j 


^iven by 






3. Return the selected sets 5^ C C . . . C 





A drawback of OMP is clearly that conditions for consistent variable selection are quite strong. 
Following Tropp (2004), the exact recovery condition for OMP is defined as 

max \\{X^Xs)-^X^Xk\\i < 1. (19) 

This is a sufficient condition for consistent variable selection. If it is not fulfilled, there exist 
regression coefficients that cause OMP or its weak variant to fail in recovery of the exact set S of 
relevant variables. Surprisingly, this condition is rather similar to the irrepresentable (Zhao and 
Yu, 2006) or neighbourhood stability condition (Meinshausen and Biihlmann, 2006). 

In the spirit of Theorem 2, we have also a proof that stability selection for randomised Orthog- 
onal Matching Pursuit (ROMP) is asymptotically consistent for variable selection in linear models, 
even if the right hand side in (19) is not bounded by 1 but instead by a possibly large constant 
(assuming the weakness a is low enough). This indicates that stability selection has a more general 
potential for improved structure estimation, beyond the case for the Lasso presented in Theorem 2. 
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Figure 5: Probability to select 0.1s and OAs important variables without selecting a noise variable 
with the Lasso in the regression setting (dark red bar) and stability selection under subsampling 
(light grey bar) for the 64 different settings. Black crosses mark the result for stability selection 
with additional randomisation (a = 0.5). 



It is noteworthy that our proof involves artificial adding of noise covariates. In practice, this seems 
to help often but a more involved discussion is beyond the scope of this paper. We will give empirical 
evidence for the usefulness of stability selection under subsampling and additional randomisation 
for orthogonal matching pursuit in the numerical examples below. 

4 Numerical Results 

To investigate further the effects of stability selection, we focus here on the application of sta- 
bility selection to Lasso and randomised Lasso for both regression and the natural extension to 
classification. The effect on OMP and randomised OMP will also be examined. 

For regression (Lasso and OMP), we generate observations by y = XP + e. For classification, 
we use the logistic linear model under the binomial family. To generate the design matrices X, we 
use two real and five simulated datasets, 

(A) Independent predictor variables. All p = 1000 predictor variables are i.i.d. standard normal 
distributed. Sample size n = 100 and n = 1000. 

(B) Block structure with 10 blocks. The p = 1000-dimensional predictor variable follows a AA(0, S) 
distribution, where = for all pairs {k, m) except if modiofc = modiom, for which 
Sfcm = 0.5. Sample size n = 200 and n = 1000. 

(C) Toeplitz design. The p = 1000-dimensional predictor variable follows a AA(0, S) distribution, 
where T,km = pl'^"™"' and p = 0.99. Sample size n = 200 and n = 1000. 
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Figure 6: The equivalent plot to Fig. 2 for Lasso applied to classification (top two rows) and OMP 
applied to regression (bottom two rows). 
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Figure 7: Comparison of stability selection with cross-validation for the real datasets (F) and 
(G). The cross-validated solution (for standard Lasso) is indicated by a dot and the corresponding 
stability selection (for randomised Lasso, a = 0.5 on the left and a = 1 on the right) by a red 
triangle, showing the average proportion of correctly identified relevant variables versus the average 
number of falsely selected variables. Each pair consisting of a dot and triangle corresponds to a 
simulation setting (some specified SNR and s). The broken vertical line indicates the value at which 
the number of wrongly selected variables is controlled, namely I^(V) < 2.5. Looking at stability 
selection, the proportion of correctly identified relevant variables is very close to the CV-solution, 
while the number of falsely selected variables is reduced dramatically. 



(D) Factor model with 2 factors. Let cpi, (/>2 be two latent variables following i.i.d. standard normal 
distributions. Each predictor variable Xk, for k = 1, . . . ,p, is generated as = fk,i4>i + 
/fc,202 + where fk^i, % have i.i.d. standard normal distributions for all k = 1, . . . ,p. 
Sample sizes are n = 200 and n = 1000, while p = 1000. 

(E) Identical to (D) but with 10 instead of 2 factors. 

(F) Motif regression dataset. A dataset (p = 660 and n = 2587) about finding transcription factor 
binding sites (motifs) in DNA sequences. The real-valued predictor variables are abundance 
scores for p candidate motifs (for each of the genes). Our dataset is from a heat-shock experi- 
ment with yeast. For a general description and motivation about motif regression we refer to 
Conlon et al. (2003). 

(G) The already mentioned vitamin gene expression data (with p = 4088 and n = 158) described 
in Section 2.2. 

We do not use the response values from the real datasets, however, as we need to know which 
variables are truly relevant or irrelevant. To this end, we create sparse regression vectors by 
setting = for all k = 1, . . . ,p, except for a randomly chosen set S of coefficients, where (3^ 
is chosen independently and uniformly in [0, 1] for all k £ S. The size s = IS*! of the active set is 
varied between 4 and 50, depending on the dataset. For regression, the noise vector (ei, . . . ,en) 
is chosen i.i.d. AA(0,cj^/n), where the rescaling of the variance with n is due to the rescaling of 
the predictor variables to unit norm, i.e. ||X('^)||2 = 1. The noise level cj^ is chosen to achieve 
signal-to-noise ratios (SNR) of 0.5 and 2. For classification, we scale the vector /? to achieve a 
given Bayes misclassification rate, either 1/8 or 1/3. Each of the 64 scenarios is run 100 times. 
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once using the standard procedure (Lasso or OMP), once using stability selection with subsampling 
and once using stability selection with subsampling and additional randomisation (q = 0.5 for the 
randomised Lasso and a = 0.9 for randomised OMP). The methods are thus in total evaluated on 
about 20.000 simulations each. 

The solution of stability selection cannot be reproduced by simply selecting the right penalty 
with Lasso, since stability selection provides a fundamentally new solution. To compare the power 
of both approaches, we look at the probability that 7 • s of the s relevant variables can be recovered 
without error, where 7 G {0.1,0.4}. A set of 7s variables is said to be recovered successfully 
for the Lasso or OMP selection, if there exists a regularisation parameter such that at least [7s] 
variables in S have a non-zero regression coefficient and all variables in = { 1 , . . . , p} \ S* have 
a zero regression coefficient. For stability selection, recovery without error means that the [7s] 
variables with highest selection probability max;^>;^^.^ /3j^ are all in S. The value Amin is chosen 
such that at most \/^&p variables are selected in the whole path of solutions for A > Amin- Note 
that this notion neglects the fact that the most advantageous regularisation parameter is selected 
here automatically for Lasso and OMP but not for stability selection. 

Results are shown in Figure 5 for Lasso apphed to regression, and in Figure 6 for Lasso apphed 
to classification and OMP applied to regression again. In Figure 5, we also give the median number 
of variables violating the irrepresentable condition (denoted by 'violations') and the average of the 
maximal correlation between a randomly chosen variable and all other variables ('max cor') as two 
measures of the difficulty of the problem. 

Stability selection identifies as many or more correct variables than the underlying method 
itself in all cases except for scenario (A), where it is about equivalent. That stability selection is 
not advantageous for scenario (A) is to be expected as the design is nearly orthogonal (very weak 
empirical correlations between variables), thus almost decomposing into p univariate decisions and 
we would not expect stability selection to help in a univariate framework. 

Often the gain of stability selection under subsampling is substantial, irrespective of the sparsity 
of the signal and the signal-to-noise-ratio. Additional randomisation helps in cases where there are 
many variables violating the irrepresentable condition; for example in setting (E). This is in line 
with our theory. 

Next, we test how well the error control of Theorem 1 holds up for these datasets. For the 
motif regression dataset (F) and the vitamin gene expression dataset (G), Lasso is applied, with 
randomisation and without. For both datasets, the signal-to-noise ratio is varied between 0.5, 1 and 
2. The number of non-zero coefficients s is varied in steps of 1 between 1 and 12, with a standard 
normal distribution for the randomly chosen non-zero coefficients. Each of the 72 settings is run 
20 times. We are interested in the comparison between the cross-validated solution and stability 
selection. For stability selection, we chose = and thresholds of -Kthr = 0.6, corresponding 

to a control of -E'(F) < 2.5, where V is the number of wrongly selected variables. The control 
is mathematically derived under the assumption of exchangeability for the distribution of noise 
variables, see Theorem 1. This assumption is most likely not fulfilled for the given dataset and it is 
of interest to see how well the bound holds up for real data. Results are shown in Figure 7. Stability 
selection reduces the number of falsely selected variables dramatically, while maintaining almost 
the same power to detect relevant variables. The number of falsely chosen variables is remarkably 
well controlled at the desired level, giving empirical evidence that the derived error control is useful 
beyond the discussed setting of exchangeability. Stability selection thus helps to select a useful 
amount of regularisation. 
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5 Discussion 



Stability selection addresses the notoriously difficult problem of structure estimation or variable 
selection, especially for high-dimensional problems. Cross-validation fails often for high-dimensional 
data, sometimes spectacularly. Stability selection is based on subsampling in combination with 
(high-dimensional) selection algorithms. The method is extremely general and we demonstrate its 
applicability for variable selection in regression and Gaussian graphical modelling. 

Stability selection provides finite sample familywise multiple testing error control (or control of 
other error rates of false discoveries) and hence a transparent principle to choose a proper amount of 
regularisation for structure estimation or variable selection. Furthermore, the solution of stability 
selection depends surprisingly little on the chosen initial regularisation. This is an additional great 
benefit besides error control. 

Another property of stability selection is the improvement over a prc-specified selection method. 
It is often the case that computationally efficient algorithms for high-dimensional selection are 
inconsistent, even in rather simple settings. We prove for randomised Lasso that stability selection 
will be variable selection consistent even if the necessary conditions needed for consistency of the 
original method are violated. And thus, stability selection will asymptotically select the right model 
in scenarios where Lasso fails. 

In short, stability selection is the marriage of subsampling and high-dimensional selection algo- 
rithms, yielding finite sample familywise error control and markedly improved structure estimation. 
Both of these main properties are demonstrated on simulated and real data. 

6 Appendix 

6.1 Sample splitting 

An alternative to subsampling is sample splitting. Instead of observing if a given variable is selected 
for a random subsample, one can look at a random split of the data into two non-overlapping 

samples of equal size [n /2j and see if the variable is chosen in both sets simultaneously. Let Ii and 
I2 be two random subsets of {1, . . . ,?7,} with = [n/2j for i = 1,2 and Ii H I2 = 0. Define the 
simultaneously selected set as the intersection of S^{Ii) and S^{l2), 

gsimult,\ ^ ^A(j^^ p gX(^j^y 

Definition 5 (Simultaneous selection probability) Define the simultaneous selection proba- 
bilities n for any set if C {1, . . . ,p} as 

Y^simult,\ P*(^K C (20) 

where the probability P* is with respect to the random sample splitting ( and any additional random- 
ness if S"^ is a randomised algorithm). 

We work with the selection probabilities based on subsampling but the following lemma lets us 
convert these probabilities easily into simultaneous selection probabilities based on sample sphtting; 
the latter is used for the proof of Theorem 1 . The bound is rather tight for selection probabilities 
close to 1. 
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Lemma 1 (Lower bound for simultaneous selection probabilities) For any set K C {l,...,p}, 

a lower bound for the simultaneous selection probabilities is given by, for every lo E fl, by 

f^sunult,X > 2n^-l. (21) 

Proof. Let Ii and I2 be the two random subsets in sample splitting of {1, . . . , n} with = [n/2j 
for i = l,2 and /in/2 = 0. Denote by sx({l, 1}) the probability P*{{K C S^{Ii)}n{K C S^ih)}). 
Note that the two events are not independent as the probability is only with respect to a random 
split of the fixed samples {1, . . . , n} into Ii and l2- The probabilities s;^({l, 0}), skUO, 1}), s_ft:({0, 0}) 
are defined cquivalcntly by P*{{K C S^{Ii)}n{K ^ S^ih)}), P*{{K ^ S^{h)}n{K C S^ih)}), 
and P*i{K ^ S^h)} n{K<^ S^ih)})- Note that n^™"^*'^ = ^^({l, 1}) and 

= sk{{1,0}) + sk{{1,1}) = SKi{0,l}) + SKi{l,l}) 
l-n^ = sk{{0,1}) + sk{{0,0}) = sk{{1,0}) + sk{{0,0}) 

It is obvious that sk{{1,0}) = sk{{0,1})- As sk{{0,0}) > 0, it also follows that sk{{1,0}) < 
1 - n^. Hence 

f^snnult,X ^ ^^^^^^ ^ fj^ _ 5^(|l^0}) > 2n^ - 1, 

which completes the proof. □ 



6.2 Proof of Theorem 1 

The proof uses mainly Lemma 2. We first show that P{k G S^) < qa/p for all k G N, using the 
made definitions = Ux^aS^ and = E{\S^\). Define furthermore Na = N n to be the 
set of noise variables (in N) which appear in and analogously Ua = S D S^. The expected 
number of falsely selected variables can be written as E{\Na\) = E{\S^\) — E{\Ua\) = Qa — ^(|^a|)- 
Using the assumption (8) (which asserts that the method is not worse than random guessing), it 
follows that £;(|C7a|) > E{\Na\)\S\/\N\. Putting together, (1 + |5|/|Af|)^(|iVA|) < ^a and hence 
I A^|~"^£'(|iVA|) < Qa/p- Using the exchangeability assumption, we have P{k G S^) = £"(1 A^a|)/|^| 
for allk ^ N and hence, for € A^, it holds that P[k G S^) < qa/p^ as desired. Note that this result 
is independent of the sample size used in the construction of 5*^, A G A. Now using Lemma 2 below, 
it follows that P(maxAeA > < {Qk/pf/i for all < ^ < 1 and A; G N. Using Lemma 1, 

it follows that P(maxAeAn^ > i^thr) < P((maxAeA n'^"*"'*'^ + l)/2 > TTthr) < {qA/p? /{'^i^thr - I)- 
Hence E{V) = ^^eN -^(maxAeA > TTthr) < Qa/ {pi^T^thr — 1)), which completes the proof. □ 

Lemma 2 Let K C {1, . . . ,p} and the set of selected variables based on a sample size of [n/2\ . 
IfP{KCS^) < e, then 

p^^simult,X £2^^_ 

IfP{K C UagA^'^) < £ for some A C M+, then 

P/ -rrsimulLX \ >\ ^ 2 

(maxH^ ' > f I < e E- 
^ AeA ^ - 1^ 

Proof. Let h^h C {1, . . . , n} be, as above, the random split of the samples {1, . . . , n} into two 
disjoint subsets, where both = \n/2\ for i = 1,2. Define the binary random variable for all 
subsets ii: C {1, . . . ,p} as i?^ := 1[K C {S'-^(/i)nS''^(/2)}}- Denote the data (the n samples) by Z. 
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The simultaneous selection probability n^™'*'*'-^^ ^s defined in (20), is then fj^™"'*'-^ = E*{H^) = 
E{H^\Z), where the expectation E* is with respect to the random split of the n samples into sets 
Ii and I2 (and additional randomness if S'^ is a randomised algorithm). To prove the first part, the 
inequality P{K C S^) < e (for a sample size [n/2j), implies that P{H^ = 1) < P{K C S^{h))'^ < 
and hence E{H^) < e^. Therefore, E{H^) = E{E{H^\Z)) = ^(n^™"^*'^) < ^2 Using a Markov- 
type inequality, ^p(n^™««'^ > ^) < ^(n^™«tt,A) < ^2^ ^j^^^ p^^sM,x > ^ ^2/^^ completing 
the proof of the first claim. The proof of the second part follows analogously. □ 



6.3 Proof of Theorem 2 

Instead of working directly with form (13) of the randomised Lasso estimator, we consider the 
equivalent formulation of the standard Lasso estimator, where all variables have initially unit norm 
and are then rescaled by their random weights W. 

Definition 6 (Additional notation) For weights W as in (13), let X"^ he the matrix of re- scaled 
variables, with X'^ = X^ ■ Wk for each k = 1, . . . ,p. Let ^^ax ^'^^ <?^min maximal and minimal 

eigenvalues analogous to (I4) for X'^ instead of X. 

The proof rests mainly on the two-fold effect a weakness a < 1 has on the selection properties of 
the Lasso. The first effect is that the singular values of the design can be distorted if working with 
the reweighted variables X^ instead of X itself. A bound on the ratio between largest and smallest 
eigenvalue is derived in Lemma 3, effectively yielding a lower bound for useful values of a. The 
following Lemma 4 then asserts, for such values of a, that the relevant variables in S are chosen 
with high probability under any random sampling of the weights. The next Lemma 5 establishes 
the key advantage of randomised Lasso as it shows that the 'irrepresentable condition' (12) is 
sometimes fulfilled under randomly sampled weights, even though its not fulfilled for the original 
data. Variables which are wrongly chosen because condition (12) is not satisfied for the original 
unweighted data will thus not be selected by stability selection. The final result is established in 
Lemma 7 after a bound on the noise contribution in Lemma 6. 

Lemma 3 Define C by {2 + 4C)s -|- 1 = Cs^ and assume s > 7. Let W be weights generated 

randomly in [a,l], as in (13), and let X'^ be the corresponding rescaled predictor variables, as 
in Definition 6. For c? > y(j)ui\n{Cs^)/{Cs'^), with v G M"*", it holds under Assumption 1 for all 
random realisations W that _ 



ma x V 
r min 



< TTTt:- (22) 



Proof Using Assumption 1, 

„2 



where the first inequality follows by Assumption 1, the equality by (2 + 4C)s + 1 = Cs^ and the 
second inequality by s > 1. It follows that 



'^min 



(Cs2) - K V Cs2 



Vmax 

(Cs^) ^ 3 + 4C /0min(Cs2) 

^ — z — V — 7^ — ■ y^'^J 
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Now, let W be again the p x p-diagonal matrix with diagonal entries Whk = for all k = 1, . . . ,p 
and on the non-diagonal elements. Then = XW and, taking suprema over all W with 
diagonal entries in [a, 1], 

(CaxM)' < sup sup (||X-^;||2/||t;||2)' 

W v£RP:\\v\\()<m 

= sup sup {v'^W'^X^XWv)/v'^v < (0m£«(m))2, 

W vm.P:\\v\\o<m 

where the last step follows by a change of variable transform v = Wv and the fact that \\v\\o = 
||W^;||o as well as v'^v = v^W^^'^W^^v and thus v'^v < v^v < a~'^v'^v for all W with diagonal 
entries in [a, 1]. The corresponding argument for ^mm("^) yields the bound ^mm(™') — (^4'mm{m) for 
all m G N. The claim (22) follows by observing that C > 1 for s > 7, since C > 1 by Assumption 1 
and hence 3 + AC <7C. □ 

Lemma 4 Let A'^'^'^ be the set {k : (5^'^ ^ 0} of selected variables of the randomised Lasso 
with weakness a G (0, 1] and randomly sampled weights W. Suppose that the weakness > 
(7/K)^(/)inin(Cs^)/(C'5^). Under the assumptions of Theorem 2, there exists a set in the sample 
space of Y with P{Y € flo) > 1 — 3/(p V a„), such that for all realisations W = w, for p > b, if 
Y eno, 

US\< Cs'' and {S \ Ssmamx) C i^'^, (24) 
where SsmaU;\ is defined as in Theorem 2. 

Proof. Follows mostly from Theorem 1 in Zhang and Huang (2008). To this end, set cq = in 
their notation. We also have Cs^ < (2 + AC)s + 1, as, by definition, (2 + 4C)s + 1 = Cs^, as 
in Lemma 3. The quantity C = c*/c* in Zhang and Huang (2008) is identical to our notation 
</'max(C's^)/<Amin(^*^)- boundcd for all random realisations oi W = w, as long as > 

(7/«;)^^min(C's^)/(Cs^), using Lemma 3, by 



Cax((2 + 4C),S + 1) ^- 

Ci„((2 + 4C)s + l) - 

Hence all assumptions of Theorem 1 in Zhang and Huang (2008) are fulfilled, with r/i = 0, for any 
random realisation W = w. Using (2.20)-(2.24) in Zhang and Huang (2008), it follows that there 
exists a set in the sample space of Y with P{Y G Oq) > 2 — exp(2/(p V a„)) — 2/{pV a^)^ > 
1 - 3/{p V an) for all p > 5, such that if F G J^o, from (2.21) in Zhang and Huang (2008), 

li^'"" US\<{2 + 4C)s < Cs^, (25) 

and, from (2.23) in Zhang and Huang (2008), 

V \l3k\^l{k ^ i^'""} < dc + + ^C^)sA2 < 5.QC%^X^ < (0.3 {Csf/^Xf, (26) 
k€S 3 9 9 

having used for the first inequality that, in the notation of Zhang and Huang (2008), l/(c*c*) < 
c*/c*. The factor was omitted to account for our different normalisation. For the second 
inequality, we used AC < Cs. The last inequality implies, by definition of Ssmall;X in Theorem 2, 
that S \ 

Ssm.all;\ ^ A^'^, which Completes the proof. D 
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Lemma 5 Set m = Cs^ . Let fc G {1, . . . ,p} and let K{w) C {1, . . . ,p} 6e a set which can depend 
on the random weight vector W. Suppose that K{w) satisfies \K{w)\ < m and k ^ K{w) for all 
realisations W = w. Suppose furthermore that K{w) = A for some A C {1, . . . ,p} implies that 
K{v) = A for all pairs w^v of weights that fulfill vj < wj for all j G {1, . . . ,p}, with equality 

for all j e A. Then, for < 0^i„(m)/(-\/2m), 

Mmx^MfX^^^^)-\X^^^)fX]:\U < 2-1/4) >^^(i _^^)-. (27) 

where the probability Pw is with respect to random sampling of the weights W and pw is, as above, 
the probability of choosing weight a for each variable and 1—pw the probability of choosing weight 1. 

Proof. Let w be the reahsation of W for which = a and Wj = 1 for all other j G {1, . . . ,p} \ fc. 
The probability of = iZ; is clearly Pw{^—PwY~^ under the used sampling scheme for the weights. 
Let A := K{w) be the selected set of variables under these weights. Let now W C be the 

set of all weights for which wt = ct and Wj = 1 for all j G A, and arbitrary values in {a, 1} for all Wj 
with j ^ A\Jk. The probability for a random weight being in this set is Pw{w G W) = Pwi^^Pw)^'^^- 
By the assumption on K, it holds that K{w) = A for all w G W, since Wj < Wj for all j G {1,. . . ,p} 
with equality for j E A. For all weights ii; G W, it follows moreover that 

((X;^)^Xl)-HX^)^Xr = a{XlXA)-'xlX,. 

Using the bound on a, it hence only remains to be shown that, if ||X/||2 = 1 for all / G {1, . . . ,p}, 

sup snp \\{XlXA)-^X^Xk\\l<m/(Pmin{m). (28) 

A:|A|<m k^A 

Since ||7||i < iyp4|||7||2 for any vector 7 G mI"^I, it is sufficient to show, for 7 := {X'^Xa)~^X'^X]^, 

sup sup II7III < l/(f)„im{m). 

A:\A\<m k^A 

As Xa'J is the projection of X^ into the space spanned by Xa and ||^A;||2 = 1, it holds that 

||X^7||i < L Using IIX57III = 7^(^I^a)7 > <^min(|^|) hill, it follows that ||7||i < l/</.n,in(|A|), 

which shows (28) and thus completes the proof. □ 

Lemma 6 Let Pa = X A{X'j^X a)^^ X^ be the projection into the space spanned by all variables in 
subset A C {1, . . . Suppose p > 10. Then there exists a set fli with P{^i) > 1 — 2/(j3 V an), 
such that for all u E fti, 

sup sup|Xj'(l - Pa)£\ < 2a{V2m + l)y/log{p W an) /n. (29) 

A:|A|<m k^A 

Proof. Let be the event that max^gji p} \X'j^e\ < 2a^/log{p V an)/n. As entries in e are i.i. 
A/'(0,a2) distributed, P{n[) > 1 - l/{p V an) for all S G (0, 1). Note that, for all Ac {1,... ,p} 
dJidk^A, \X'^Pa£\ < \\Pa£\\2- Define O'/ as 

sup ||Pa£||2 < 2a^/2mlog{pW an)/n. (30) 

|A|<m 

It is now sufficient to show that P{'^i) > 1 — 1 /{p V a„). Showing this bound is related to a bound 
in Zhang and Huang (2008) and we repeat a similar argument. Each term y^||P^£||2/(7 has a X\a\ 
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distribution as long as is of full rank \A\. Hence, using the same standard tail bound as in the 
proof of Theorem 3 of Zhang and Huang (2008), 

p(n\\PAe\\l/a'' > |A|(1 + 41og(p V a„))) < ((j. V a„)-4(l + 41og(p V a„)))l-^l/2 < (;,Va„)-W, 
having used l+41og(pVa„) < (pyan) for all p > 10 in the last step and thus, using (|^|) < /\A\\, 

Pm>l - ^ ( M(pVa„)-W>l - 5^(pVa„)-l^l/V(|A|)!>l-l/(pVa„), 

\A\=2 ^' \A\=2 

which completes the proof by setting fli = Q[ n O'/ and concluding that P{^i) > 1 — 2/(p V a„) 
for all p > 10. □ 

Lemma 7 Let Sw = Pw{^ — Pw)^^^ and H^ = Pw{k G A^^^) he again the probability for variable 

k of being in the selected subset, with respect to random sampling of the weights W . Then, under 
the assumptions of Theorem 2, for all k ^ S and p > 10, there exists a set with P{^a) > 
1 — 5/(p V Qn) such that for all uj G Q^a and A > Amin; 



max Hi < l-5yj (31) 

keN k w V y 



^ min H^ > 1 - <5^, (32) 
where Ssmall;X defined as in Theorem 2. 

Proof Wc let Qa = f^o ^ Oi, where is the event defined in Lemma 4 and event f^i is defined in 
Lemma 6. Since, using these two lemmas, 

Pino n J^i) > 1 - P(n^) - P{ni) > l - 3/{p V a„) - 2/{p V a„) = l - 5/{p V an), 

it is sufficient to show (31) and (32) for all w G H Oi. We begin with (31). A variable k ^ S is 
in the selected set A^'^ only if 

\{X;:f{Y-X^,p'''^'-')\>X, (33) 

where is the solution to (13) with the constraint that = 0, comparable to the 

analysis in Meinshausen and Biihlmann (2006). Let A^'^'~^ := {j : j3^'^~'^ ^ 0} be the set of 
non-zero coefficients and B'^'^^^'' := U be the set of regression coefficients which are 

either truly non-zero or estimated as non-zero (or both). We will use as a short-hand notation 
for B^'^'"^. Let PV be the projection operator into the space spanned by all variables in the set 

B. For allW = w, this is identical to 

P| = x|((x|)^x|)-ix| = x^ixTx^r'x^ = P^. 

Then, splitting the term {X]^)'^{Y - X'^^^P^'^-'') in (33) into the two terms 

(X^fil - PV){Y - X-,/3^''^'-'^) + {X]:fP^{Y - X-,/3^''^'-'=), (34) 
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it holds for the right term in (34) that 

< \\{{x^fx^)-\x^fxniX. 

Looking at the left term in (34), since Y G ^Iq, we know by Lemma 4 that \B\ < Cs^ and, by 
definition of B above, S C B. Thus the left term in (34) is bounded from above by 

{X]: f{l - P|)e < sup sup \{Xkf{l - Ps)e\ ■ \\X]:\\2/\\Xk\\2 

A:\A\<Cs^ k^A 

■^min ii^ni2/ii^fcii2, 

having used Lemma 6 in the last step and Amin = 2a{V2Cs + l)^/log{p V an)/n. Putting together, 
the two terms in (34) are bounded, for all a; G H Oi, by 

A„.i„iix,-ii2/iix,ii2 + \\{{xvfx^r\x^fx]:\\ix. 

We now apply Lemma 5 to the rightmost term. The set S is a function of the weight vector and 
satisfies for every realisation of the observations Y e the conditions in Lemma 5 on the set 
K{w). First, \B\ < Cs^. Second, by definition of B above, k ^ B for all weights w. Third, it 
follows by the KKT conditions for Lasso that the set of non-zero coefficients of f3-^'^~^ and j3^''"~'^ 
is identical for two weight vectors w and v, as long Vj = wj for all j G A'^'^'^'^ and Vj < wj for all 
j ^ A.\w,-k (increasing the penalty on zero coefficients will leave them at zero, if the penalty for 
non-zero coefficients is kept constant). Hence there exists a set in the sample space of W with 
Pwi^w) >1-S^ such that \\{{Xp^Xp-^{Xp'^X]^\\i < 2-V4. Moreover, for the same set O^, 
we have ||X^||2/||Xfe||2 = a < 1/s < 1/7. Hence, for all a; G fio H Oi and, for all lj G flw, the Ihs of 
(33) is bounded from above by Ainin/7 + A2'~^/^ < A and variable k ^ S is hence not part of the set 
A^'^. It follows that maxAeAHfc < 1 — Sw with Sw = Pwi^ — Vw)'^^ for all A; ^ S. This completes 
the first part (31) of the proof. 

For the second part (32), we need to show that, for all oj G fio H r^i, all variables k va. S are 
chosen with probability at least 1 — (5^ (with respect to random sampling of the weights W) , except 
possibly for variables in Ssmaii;X ^ 5*, defined in Theorem 2. For all uj £ Qq, however, it follows 
directly from Lemma 4 that (5 \ Ssm,aii,x) ^ A"^'^ . Hence, for all A; G S' \ Ssmall;\j the selection 
probability satisfies fl^ = 1 for all Y & ^Iq, which completes the proof. □ 

Since the statement in Lemma 7 is a reformulation of the assertion of Theorem 2, the proof of 
the latter is complete. 
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